Extend check namespace usage #38347

kevinetienne · 2020-12-07T21:36:39Z

xref Extend check for inconsistent pandas namespace usage #38093
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff

TLDR: This code changes might be not in a mergeable state as false positives remain when updating the pattern.

The change is a bit bigger to what is described in #38093 as more files needed to be fixed, which is fine I think.

The stack displaying the files to fix might be limited to a few items (9 items in my case) when running the command pre-commit run inconsistent-namespace-usage --all-files. Every time a set of files was fixed a few others appeared which is why I ended fixing up more files and why my list of false positive is not exhaustive.

I have committed the change to the PATTERN in a separate commit as we have false positives. Such as the following files which contain:

a similar function in the numpy package (ex: np.array see

pandas/pandas/tests/test_strings.py

Line 188 in 862cd05

values = np.array(values, dtype=object) # object dtype to avoid casting

)
a similar function in the pyarrow package (ex: pyarrow.array see

pandas/pandas/tests/io/test_parquet.py

Line 839 in 862cd05

"a": pyarrow.array([1, 2, 3, None], "int64"),

)
a function can be called on dataframe or a serie (ex: df.isna and aggr.isna see

pandas/pandas/tests/groupby/test_categorical.py

Line 528 in 862cd05

aggr[aggr.isna()] = "missing"

and

pandas/pandas/tests/groupby/test_categorical.py

Line 1570 in 862cd05

result = df.groupby(["A", "B"]).agg(lambda df: df.isna().sum())
or a function is called with any prefixes (see the few calls to factorize in

pandas/pandas/tests/test_algos.py

Line 52 in 862cd05

result_codes, result_uniques = obj.factorize(sort=sort)

)
eval vs pd.eval (see https://github.com/pandas-dev/pandas/blob/862cd05df4452592a99dd1a4fa10ce8cfb3766f7/pandas/tests/computation/test_eval.py)

I'm probably missing a few other cases but 1) 2) and 3) can be added in the negative lookahead (ignore function/methods which start with np/pyarrow/df). However the regex might grow bigger and bigger. 5) can be ignored entirely either within the regex or with an exclusion list.

Let me know you're thinking as I see three solutions:

update the regex and maybe add an exclusion list (for case like eval) and see where that leads
remove last commit for PATTERN and create another ticket to address the other cases
close the ticket and find/investigate another solution (ex: separate class vs function check/check import instead...)

MarcoGorelli · 2020-12-08T19:21:42Z

eval vs pd.eval

I hadn't thought of that...not entirely sure what the best way to go about this is but I'll get back to you on this

MarcoGorelli · 2020-12-16T14:03:21Z

Hey @kevinetienne

Sorry this taken me a bit to get back to you on. I've looked into it further and a few suggestions:

let's make a blocklist in which to put the (few) cases we want to ignore, such as eval and pd.eval
instead of checking that the pattern doesn't start with pd., let's just check it doesn't start with .

I think that applying the following diff should get us a good way there:

diff --git a/scripts/check_for_inconsistent_pandas_namespace.py b/scripts/check_for_inconsistent_pandas_namespace.py
index b213d931e7..908a056fdc 100644
--- a/scripts/check_for_inconsistent_pandas_namespace.py
+++ b/scripts/check_for_inconsistent_pandas_namespace.py
@@ -16,20 +16,20 @@ from typing import Optional, Sequence
 
 PATTERN = r"""
     (
-        (?<!pd\.)(?<!\w)    # check class_name doesn't start with pd. or character
-        ([A-Z]\w+)\(        # match DataFrame but not pd.DataFrame or tm.makeDataFrame
-        .*                  # match anything
-        pd\.\2\(            # only match e.g. pd.DataFrame
+        (?<![\.\w])    # check class_name doesn't start with . or character
+        (\w+)\(        # match DataFrame but not pd.DataFrame or tm.makeDataFrame
+        .*             # match anything
+        pd\.\2\(       # only match e.g. pd.DataFrame
     )|
     (
-        pd\.([A-Z]\w+)\(    # only match e.g. pd.DataFrame
-        .*                  # match anything
-        (?<!pd\.)(?<!\w)    # check class_name doesn't start with pd. or character
-        \4\(                # match DataFrame but not pd.DataFrame or tm.makeDataFrame
+        pd\.(\w+)\(    # only match e.g. pd.DataFrame
+        .*             # match anything
+        (?<![\.\w])    # check class_name doesn't start with . or character
+        \4\(           # match DataFrame but not pd.DataFrame or tm.makeDataFrame
     )
     """
 ERROR_MESSAGE = "Found both `pd.{class_name}` and `{class_name}` in {path}"
-
+BLOCK_LIST = {'eval'}
 
 def main(argv: Optional[Sequence[str]] = None) -> None:
     parser = argparse.ArgumentParser()
@@ -45,11 +45,11 @@ def main(argv: Optional[Sequence[str]] = None) -> None:
         match = pattern.search(contents)
         if match is None:
             continue
-        if match.group(2) is not None:
+        if match.group(2) is not None and match.group(2).decode() not in BLOCK_LIST:
             raise AssertionError(
                 ERROR_MESSAGE.format(class_name=match.group(2).decode(), path=str(path))
             )
-        if match.group(4) is not None:
+        if match.group(4) is not None and match.group(4).decode() not in BLOCK_LIST:
             raise AssertionError(
                 ERROR_MESSAGE.format(class_name=match.group(4).decode(), path=str(path))
             )

Also, there's a few conflicts, could you please fetch and merge upstream/master?

…8347

jreback · 2021-01-03T17:22:57Z

@kevinetienne pls merge master. this will need to be merged soon after otherwise likely will have further conflicts. ping on green.

jreback · 2021-01-03T17:32:28Z

@kevinetienne would prefer that you limit scope to this to a single type of change (and instead use multiple PRs) as large PRs are very likley to take a long time to review (unless they are trivial changes but that is not the case here)

kevinetienne · 2021-01-04T09:32:19Z

I'm closing this PR and I'll update the main ticket later. I'll break this PR in several parts. It would be easier for me to deliver too.

kevinetienne and others added 2 commits December 7, 2020 21:22

Update inconsistent pandas namespace

4f2b48b

Update namespace check to check for function

078339e

MarcoGorelli self-requested a review December 8, 2020 15:03

MarcoGorelli added 7 commits January 3, 2021 11:12

catch more cases

be1f67d

Merge remote-tracking branch 'upstream/master' into pr/kevinetienne/3…

21edd76

…8347

fixup extra failing tests from merge

a2aede0

import isna and read_csv

764679d

fixup isna

0d05638

fixup regex match in test_integrity

902d386

dont use array fixture

b529548

jreback added the Code Style Code style, linting, code_checks label Jan 3, 2021

kevinetienne closed this Jan 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend check namespace usage #38347

Extend check namespace usage #38347

kevinetienne commented Dec 7, 2020 •

edited by jreback

Loading

MarcoGorelli commented Dec 8, 2020

MarcoGorelli commented Dec 16, 2020 •

edited

Loading

jreback commented Jan 3, 2021

jreback commented Jan 3, 2021

kevinetienne commented Jan 4, 2021

Extend check namespace usage #38347

Extend check namespace usage #38347

Conversation

kevinetienne commented Dec 7, 2020 • edited by jreback Loading

MarcoGorelli commented Dec 8, 2020

MarcoGorelli commented Dec 16, 2020 • edited Loading

jreback commented Jan 3, 2021

jreback commented Jan 3, 2021

kevinetienne commented Jan 4, 2021

kevinetienne commented Dec 7, 2020 •

edited by jreback

Loading

MarcoGorelli commented Dec 16, 2020 •

edited

Loading